基于双向LSTM和注意机制的语音活动检测模型

2018 IEEE 10th International Conference on Humanoid, Nanotechnology, Information Technology,Communication and Control, Environment and Management (HNICEM) Pub Date : 2018-11-01 DOI:10.1109/HNICEM.2018.8666342

Yeonguk Yu, Yoonjoong Kim

{"title":"基于双向LSTM和注意机制的语音活动检测模型","authors":"Yeonguk Yu, Yoonjoong Kim","doi":"10.1109/HNICEM.2018.8666342","DOIUrl":null,"url":null,"abstract":"In this study, we proposed a deep learning model that consists of the bidirectional Long-Short Term Memory (bi-LSTM) and the attention mechanism to perform frame-wise Voice Activity Detection (VAD). The bi-LSTM extracts annotations of frame by summarizing information from both direction. The attention mechanism accepts the annotations to extracts such frames that are important to the voice activity judgement and aggregates the representation of those informative frames to form an attention distribution vector. It is used as features for frame classification by logistic classification approach. We constructed four comparative models to perform experiments with TIMIT corpus and noise signals. The excrement shows that the proposed model outperforms the conventional VAD with LSTM. And we showed how the attention mechanism can help VAD tasks by visualizing the attention distribution of the model.","PeriodicalId":426103,"journal":{"name":"2018 IEEE 10th International Conference on Humanoid, Nanotechnology, Information Technology,Communication and Control, Environment and Management (HNICEM)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"A Voice Activity Detection Model Composed of Bidirectional LSTM and Attention Mechanism\",\"authors\":\"Yeonguk Yu, Yoonjoong Kim\",\"doi\":\"10.1109/HNICEM.2018.8666342\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this study, we proposed a deep learning model that consists of the bidirectional Long-Short Term Memory (bi-LSTM) and the attention mechanism to perform frame-wise Voice Activity Detection (VAD). The bi-LSTM extracts annotations of frame by summarizing information from both direction. The attention mechanism accepts the annotations to extracts such frames that are important to the voice activity judgement and aggregates the representation of those informative frames to form an attention distribution vector. It is used as features for frame classification by logistic classification approach. We constructed four comparative models to perform experiments with TIMIT corpus and noise signals. The excrement shows that the proposed model outperforms the conventional VAD with LSTM. And we showed how the attention mechanism can help VAD tasks by visualizing the attention distribution of the model.\",\"PeriodicalId\":426103,\"journal\":{\"name\":\"2018 IEEE 10th International Conference on Humanoid, Nanotechnology, Information Technology,Communication and Control, Environment and Management (HNICEM)\",\"volume\":\"25 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE 10th International Conference on Humanoid, Nanotechnology, Information Technology,Communication and Control, Environment and Management (HNICEM)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HNICEM.2018.8666342\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 10th International Conference on Humanoid, Nanotechnology, Information Technology,Communication and Control, Environment and Management (HNICEM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HNICEM.2018.8666342","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

在这项研究中，我们提出了一个由双向长短期记忆(bi-LSTM)和注意机制组成的深度学习模型来执行分帧语音活动检测(VAD)。bi-LSTM通过对两个方向的信息进行汇总来提取帧的注释。注意机制接受注释提取对语音活动判断有重要意义的帧，并将这些信息帧的表示聚合形成注意分布向量。将其作为特征，采用逻辑分类方法对框架进行分类。我们构建了四个比较模型，在TIMIT语料库和噪声信号下进行了实验。实验结果表明，该模型优于基于LSTM的传统VAD。我们展示了注意力机制如何通过可视化模型的注意力分布来帮助VAD任务。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Voice Activity Detection Model Composed of Bidirectional LSTM and Attention Mechanism

In this study, we proposed a deep learning model that consists of the bidirectional Long-Short Term Memory (bi-LSTM) and the attention mechanism to perform frame-wise Voice Activity Detection (VAD). The bi-LSTM extracts annotations of frame by summarizing information from both direction. The attention mechanism accepts the annotations to extracts such frames that are important to the voice activity judgement and aggregates the representation of those informative frames to form an attention distribution vector. It is used as features for frame classification by logistic classification approach. We constructed four comparative models to perform experiments with TIMIT corpus and noise signals. The excrement shows that the proposed model outperforms the conventional VAD with LSTM. And we showed how the attention mechanism can help VAD tasks by visualizing the attention distribution of the model.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 IEEE 10th International Conference on Humanoid, Nanotechnology, Information Technology,Communication and Control, Environment and Management (HNICEM)

自引率

0.00%

发文量