{"title":"A Voice Activity Detection Model Composed of Bidirectional LSTM and Attention Mechanism","authors":"Yeonguk Yu, Yoonjoong Kim","doi":"10.1109/HNICEM.2018.8666342","DOIUrl":null,"url":null,"abstract":"In this study, we proposed a deep learning model that consists of the bidirectional Long-Short Term Memory (bi-LSTM) and the attention mechanism to perform frame-wise Voice Activity Detection (VAD). The bi-LSTM extracts annotations of frame by summarizing information from both direction. The attention mechanism accepts the annotations to extracts such frames that are important to the voice activity judgement and aggregates the representation of those informative frames to form an attention distribution vector. It is used as features for frame classification by logistic classification approach. We constructed four comparative models to perform experiments with TIMIT corpus and noise signals. The excrement shows that the proposed model outperforms the conventional VAD with LSTM. And we showed how the attention mechanism can help VAD tasks by visualizing the attention distribution of the model.","PeriodicalId":426103,"journal":{"name":"2018 IEEE 10th International Conference on Humanoid, Nanotechnology, Information Technology,Communication and Control, Environment and Management (HNICEM)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 10th International Conference on Humanoid, Nanotechnology, Information Technology,Communication and Control, Environment and Management (HNICEM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HNICEM.2018.8666342","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
In this study, we proposed a deep learning model that consists of the bidirectional Long-Short Term Memory (bi-LSTM) and the attention mechanism to perform frame-wise Voice Activity Detection (VAD). The bi-LSTM extracts annotations of frame by summarizing information from both direction. The attention mechanism accepts the annotations to extracts such frames that are important to the voice activity judgement and aggregates the representation of those informative frames to form an attention distribution vector. It is used as features for frame classification by logistic classification approach. We constructed four comparative models to perform experiments with TIMIT corpus and noise signals. The excrement shows that the proposed model outperforms the conventional VAD with LSTM. And we showed how the attention mechanism can help VAD tasks by visualizing the attention distribution of the model.