{"title":"MarbleNet: A Deep Neural Network Solution for Vietnamese Voice Activity Detection","authors":"Hoang-Thi Nguyen-Vo, Huy Nguycn-Gia, Hoan-Duy Nguyen-Tran, Hoang Pham-Minh, Hung Vo-Thanh, Hao Do-Due","doi":"10.1109/NICS56915.2022.10013457","DOIUrl":null,"url":null,"abstract":"Voice activity detection in the wild is considered to be challenging work, especially when applied to the Vietnamese language as many proposed approaches are not extensive enough. In this paper, we aim to solve this problem by using MarbleNet, a model built on top of previous successful applications of using ID CNNs to solve conventional problems. We compiled a dataset, a combination of the VIVOS dataset for speech labelling and audios collected from Freesound.org for background noise. We present the performance of MarbleNet on the chosen dataset and perform experiments that compare the performance of MarbleNet and two other CNN-based architectures to measure the efficiency of our solution. Experiments show that MarbleNet, with a smaller size, can outperform other CNN-based models in clean and many noisy environments.","PeriodicalId":381028,"journal":{"name":"2022 9th NAFOSTED Conference on Information and Computer Science (NICS)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 9th NAFOSTED Conference on Information and Computer Science (NICS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NICS56915.2022.10013457","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Voice activity detection in the wild is considered to be challenging work, especially when applied to the Vietnamese language as many proposed approaches are not extensive enough. In this paper, we aim to solve this problem by using MarbleNet, a model built on top of previous successful applications of using ID CNNs to solve conventional problems. We compiled a dataset, a combination of the VIVOS dataset for speech labelling and audios collected from Freesound.org for background noise. We present the performance of MarbleNet on the chosen dataset and perform experiments that compare the performance of MarbleNet and two other CNN-based architectures to measure the efficiency of our solution. Experiments show that MarbleNet, with a smaller size, can outperform other CNN-based models in clean and many noisy environments.