{"title":"Masked cross self-attentive encoding based speaker embedding for speaker verification","authors":"Soonshin Seo, Ji-Hwan Kim","doi":"10.7776/ASK.2020.39.5.497","DOIUrl":"https://doi.org/10.7776/ASK.2020.39.5.497","url":null,"abstract":"Constructing speaker embeddings in speaker verification is an important issue. In general, a self-attention mechanism has been applied for speaker embedding encoding. Previous studies focused on training the self-attention in a high-level layer, such as the last pooling layer. In this case, the effect of low-level layers is not well represented in the speaker embedding encoding. In this study, we propose Masked Cross Self-Attentive Encoding (MCSAE) using ResNet. It focuses on training the features of both high-level and low-level layers. Based on multi-layer aggregation, the output features of each residual layer are used for the MCSAE. In the MCSAE, the interdependence of each input features is trained by cross self-attention module. A random masking regularization module is also applied to prevent overfitting problem. The MCSAE enhances the weight of frames representing the speaker information. Then, the output features are concatenated and encoded in the speaker embedding. Therefore, a more informative speaker embedding is encoded by using the MCSAE. The experimental results showed an equal error rate of 2.63 % using the VoxCeleb1 evaluation dataset. It improved performance compared with the previous self-attentive encoding and state-of-the-art methods.","PeriodicalId":42689,"journal":{"name":"Journal of the Acoustical Society of Korea","volume":"39 1","pages":"497-504"},"PeriodicalIF":0.4,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47000827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Triplet loss based domain adversarial training for robust wake-up word detection in noisy environments","authors":"Hyungjun Lim, Myunghun Jung, Hoirin Kim","doi":"10.7776/ASK.2020.39.5.468","DOIUrl":"https://doi.org/10.7776/ASK.2020.39.5.468","url":null,"abstract":"A good acoustic word embedding that can well express the characteristics of word plays an important role in wake-up word detection (WWD). However, the representation ability of acoustic word embedding may be weakened due to various types of environmental noise occurred in the place where WWD works, causing performance degradation. In this paper, we proposed triplet loss based Domain Adversarial Training (tDAT) mitigating environmental factors that can affect acoustic word embedding. Through experiments in noisy environments, we verified that the proposed method effectively improves the conventional DAT approach, and checked its scalability by combining with other method proposed for robust WWD.","PeriodicalId":42689,"journal":{"name":"Journal of the Acoustical Society of Korea","volume":"39 1","pages":"468-475"},"PeriodicalIF":0.4,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47175818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performance comparison of wake-up-word detection on mobile devices using various convolutional neural networks","authors":"Sangho Lee","doi":"10.7776/ASK.2020.39.5.454","DOIUrl":"https://doi.org/10.7776/ASK.2020.39.5.454","url":null,"abstract":"Artificial intelligence assistants that provide speech recognition operate through cloud-based voice recognition with high accuracy. In cloud-based speech recognition, Wake-Up-Word (WUW) detection plays an important role in activating devices on standby. In this paper, we compare the performance of Convolutional Neural Network (CNN)-based WUW detection models for mobile devices by using Google's speech commands dataset, using the spectrogram and mel-frequency cepstral coefficient features as inputs. The CNN models used in this paper are multi-layer perceptron, general convolutional neural network, VGG16, VGG19, ResNet50, ResNet101, ResNet152, MobileNet. We also propose network that reduces the model size to 1/25 while maintaining the performance of MobileNet is also proposed.","PeriodicalId":42689,"journal":{"name":"Journal of the Acoustical Society of Korea","volume":"39 1","pages":"454-460"},"PeriodicalIF":0.4,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46347014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"α-feature map scaling for raw waveform speaker verification","authors":"Jee-weon Jung, Hye-jin Shim, Ju-ho Kim, Ha-jin Yu","doi":"10.7776/ASK.2020.39.5.441","DOIUrl":"https://doi.org/10.7776/ASK.2020.39.5.441","url":null,"abstract":"","PeriodicalId":42689,"journal":{"name":"Journal of the Acoustical Society of Korea","volume":"39 1","pages":"441-446"},"PeriodicalIF":0.4,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46488187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Acoustic model training using self-attention for low-resource speech recognition","authors":"Hosung Kim","doi":"10.7776/ASK.2020.39.5.483","DOIUrl":"https://doi.org/10.7776/ASK.2020.39.5.483","url":null,"abstract":"This paper proposes acoustic model training using self-attention for low-resource speech recognition. In low-resource speech recognition, it is difficult for acoustic model to distinguish certain phones. For example, plosive /d/ and /t/, plosive /g/ and /k/ and affricate /z/ and /ch/. In acoustic model training, the self-attention generates attention weights from the deep neural network model. In this study, these weights handle the similar pronunciation error for low-resource speech recognition. When the proposed method was applied to Time Delay Neural Network-Output gate Projected Gated Recurrent Unit (TNDD-OPGRU)-based acoustic model, the proposed model showed a 5.98 % word error rate. It shows absolute improvement of 0.74 % compared with TDNN-OPGRU model.","PeriodicalId":42689,"journal":{"name":"Journal of the Acoustical Society of Korea","volume":"39 1","pages":"483-489"},"PeriodicalIF":0.4,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42189624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Absolute sound level algorithm for contents platform","authors":"Du-Heon Gyeon","doi":"10.7776/ASK.2020.39.5.424","DOIUrl":"https://doi.org/10.7776/ASK.2020.39.5.424","url":null,"abstract":"This paper describes an algorithm that calculates Absolute Sound Level (ASL) for contents platform. ASL is a single volume representing individual sound sources and is a concept designed to integrate and utilize the sound level units in digital sound source and physical domain from a speaker in practical areas. For this concept to be used in content platforms and others, it is necessary to automatically derive the ASL without having to go through a hearing of mastering engineers. The key parameters of which a person recognizes the representative sound level of an individual single sound source are the areas of “frequency, maximum energy, energy variation coefficient, and perceived energy distribution,” and the ASL was calculated through the normalizing of the weights.","PeriodicalId":42689,"journal":{"name":"Journal of the Acoustical Society of Korea","volume":"39 1","pages":"424-434"},"PeriodicalIF":0.4,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48592183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improved speech enhancement of multi-channel Wiener filter using adjustment of principal subspace vector","authors":"Gibak Kim","doi":"10.7776/ASK.2020.39.5.490","DOIUrl":"https://doi.org/10.7776/ASK.2020.39.5.490","url":null,"abstract":"We present a method to improve the performance of the multi-channel Wiener filter in noisy environment. To build subspace-based multi-channel Wiener filter, in the case of single target source, the target speech component can be effectively estimated in the principal subspace of speech correlation matrix. The speech correlation matrix can be estimated by subtracting noise correlation matrix from signal correlation matrix based on the assumption that the cross-correlation between speech and interfering noise is negligible compared with speech correlation. However, this assumption is not valid in the presence of strong interfering noise and significant error can be induced in the principal subspace accordingly. In this paper, we propose to adjust the principal subspace vector using speech presence probability and the steering vector for the desired speech source. The multi-channel speech presence probability is derived in the principal subspace and applied to adjust the principal subspace vector. Simulation results show that the proposed method improves the performance of multi-channel Wiener filter in noisy environment.","PeriodicalId":42689,"journal":{"name":"Journal of the Acoustical Society of Korea","volume":"39 1","pages":"490-496"},"PeriodicalIF":0.4,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48121114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ara Bae, Ki‑mu Yoon, Jaehong Jung, Bokyung Chung, Wooil Kim
{"title":"I-vector similarity based speech segmentation for interested speaker to speaker diarization system","authors":"Ara Bae, Ki‑mu Yoon, Jaehong Jung, Bokyung Chung, Wooil Kim","doi":"10.7776/ASK.2020.39.5.461","DOIUrl":"https://doi.org/10.7776/ASK.2020.39.5.461","url":null,"abstract":"In noisy and multi-speaker environments, the performance of speech recognition is unavoidably lower than in a clean environment. To improve speech recognition, in this paper, the signal of the speaker of interest is extracted from the mixed speech signals with multiple speakers. The VoiceFilter model is used to effectively separate overlapped speech signals. In this work, clustering by Probabilistic Linear Discriminant Analysis (PLDA) similarity score was employed to detect the speech signal of the interested speaker, which is used as the reference speaker to VoiceFilter-based separation. Therefore, by utilizing the speaker feature extracted from the detected speech by the proposed clustering method, this paper propose a speaker diarization system using only the mixed speech without an explicit reference speaker signal. We use phone-dataset consisting of two speakers to evaluate the performance of the speaker diarization system. Source to Distortion Ratio (SDR) of the operator (Rx) speech and customer speech (Tx) are 5.22 dB and –5.22 dB respectively before separation, and the results of the proposed separation system show 11.26 dB and 8.53 dB respectively.","PeriodicalId":42689,"journal":{"name":"Journal of the Acoustical Society of Korea","volume":"39 1","pages":"461-467"},"PeriodicalIF":0.4,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42501548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Junsu Lee, Yeon-Seong Park, Miji Kim, Changhan Yoon
{"title":"Development of portable single-beam acoustic tweezers for biomedical applications","authors":"Junsu Lee, Yeon-Seong Park, Miji Kim, Changhan Yoon","doi":"10.7776/ASK.2020.39.5.435","DOIUrl":"https://doi.org/10.7776/ASK.2020.39.5.435","url":null,"abstract":"Single-beam acoustic tweezers that are capable of manipulating micron-size particles in a non-contact manner have been used in many biological and biomedical applications. Current single-beam acoustic tweezer systems developed for in vitro experiments consist of a function generator and a power amplifier, thus the system is bulky and expensive. This configuration would not be suitable for in vivo and clinical applications. Thus, in this paper, we present a portable single-beam acoustic tweezer system and its performances of trapping and manipulating micron-size objects. The developed system consists of an Field Programmable Gate Array (FPGA) chip and two pulsers, and parameters such as center frequency and pulse duration were controlled by a Personal Computer (PC) via a USB (Universal Serial Bus) interface in real-time. It was shown that the system was capable of generating the transmitting pulse up to 20 MHz, and producing sufficient intensity to trap microparticles and cells. The performance of the system was evaluated by trapping and manipulating 40 μm and 90 μm in diameter polystyrene particles.","PeriodicalId":42689,"journal":{"name":"Journal of the Acoustical Society of Korea","volume":"39 1","pages":"435-440"},"PeriodicalIF":0.4,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44370393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kunwoo Kim, Seo-Yoon Ryu, C. Cheong, Seongjin Seo, Cheolmin Jang
{"title":"Optimal design of impeller in fan motor unit of cordless vacuum cleaner for improving flow performance and reducing aerodynamic noise","authors":"Kunwoo Kim, Seo-Yoon Ryu, C. Cheong, Seongjin Seo, Cheolmin Jang","doi":"10.7776/ASK.2020.39.5.379","DOIUrl":"https://doi.org/10.7776/ASK.2020.39.5.379","url":null,"abstract":"In this study, the flow and noise performances of high-speed fan motor unit for cordless vacuum cleaner is improved by optimizing the impeller which drives the suction air through flow passage of the cordless vacuum cleaner. Firstly, the unsteady incompressible Reynolds averaged Navier-Stokes (RANS) equations are solved to investigate the flow through the fan motor unit using the computational fluid dynamics techniques. Based on flow field results, the Ffowcs-Williams and Hawkings (FW-H) integral equation is used to predict flow noise radiated from the impeller. Predicted results are compared to the measured ones, which confirms the validity of the numerical method used. It is found that the strong vortex is formed around the mid-chord region of the main blades where the blade curvature change rapidly. Given that vortex acts as a loss for flow and a noise source for noise, impeller blade is redesigned to suppress the identified vortex. The response surface method using two factors is employed to determine the optimum inlet and outlet sweep angles for maximum flow rate and minimum noise. Further analysis of finally selected design confirms the improved flow and noise performance.","PeriodicalId":42689,"journal":{"name":"Journal of the Acoustical Society of Korea","volume":"39 1","pages":"379-389"},"PeriodicalIF":0.4,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45624374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}