D. Pham, Quang-Anh Do, Thanh Thi Hien Duong, Thi-Lan Le, Phi-Le Nguyen
{"title":"End-to-end Visual-guided Audio Source Separation with Enhanced Losses","authors":"D. Pham, Quang-Anh Do, Thanh Thi Hien Duong, Thi-Lan Le, Phi-Le Nguyen","doi":"10.23919/APSIPAASC55919.2022.9980162","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980162","url":null,"abstract":"Visual-guided Audio Source Separation (VASS) refers to separating individual sound sources from an audio mixture of multiple simultaneous sound sources by using additional visual features that guide the separation process. For the VASS task, visual features and the correlation of audio and visual play an important role, based on which we manage to estimate better audio masks to improve the separation performance. In this paper, we propose an approach to jointly train the components of a cross-modal retrieval framework with video data and enable the network to find more optimal features. Such end-to-end framework is trained with three loss functions: 1) separation loss to limit the separated magnitude spectrogram discrepancy, 2) object-consistency loss to enforce the consistency of the separated audio with the visual information, and 3) cross-modal loss to maximize the correlation of audio and its corresponding visual sounding object while also maximize the difference between the audio and visual information of different objects. The proposed VASS model was evaluated on the benchmark dataset MUSIC, which contains a large number of videos of people playing instruments in different combinations. Experiment results confirmed the advantages of our model over previous VASS models.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"93 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126137504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Continuous authentication for smartphones using face images and touch-screen operation","authors":"Shuto Kinoshita, Yuka Watanabe, Y. Yamazaki","doi":"10.23919/APSIPAASC55919.2022.9980045","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980045","url":null,"abstract":"Conventional user authentication methods for smartphones using PINs, passwords, pattern locks, etc. have a problem in that user authentication is not performed continuously after the first authentication success; therefore, there is a risk that an authenticated smartphone might be used improperly by unauthorized individuals. We propose a novel continuous authentication method for smartphones that uses face images and touch-screen operation and evaluated its effectiveness.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129980432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Meihuang Wang, Minmin Yuan, Andong Li, C. Zheng, Xiaodong Li
{"title":"A Deep Proximal-Unfolding Method for Monaural Speech Dereverberation","authors":"Meihuang Wang, Minmin Yuan, Andong Li, C. Zheng, Xiaodong Li","doi":"10.23919/APSIPAASC55919.2022.9979935","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9979935","url":null,"abstract":"Speech is often distorted by reverberation in an enclosure when the microphone is placed far away from the speech source, reducing speech quality and intelligibility. Recent years have witnessed the development of deep neural networks, and many deep learning-based methods have been proposed for dereverberation. Most deep learning-based methods remove the reverberation by directly mapping the reverberant speech to target speech, which often lacks adequate interpretability, limiting the performance upper bound. This paper proposes a deep un-folding method with an interpretable network structure. First, the dereverberation problem was reformulated based on maximum posterior criterion, and an iterative optimization algorithm was then devised by using proximal operators. Second, we unfolded the iterative optimization algorithm into multi-stage deep neural network, where each stage corresponded to a specific operation of the iterative procedure. Experiments were conducted on the WSJO-SI84 corpus, and the results on both simulated and real RIRs showed that the proposed model outperformed previous models and achieved state-of-the-art performance in terms of PESQ, ESTOI and frequency-weighted segmental SNR.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122365955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Grégoire De Broglie, Louis Morge-Rollet, D. L. Jeune, F. Roy, C. Roland, Charles Canaff, J. Diguet
{"title":"New Methods for Fast Detection for Embedded Cognitive Radio","authors":"Grégoire De Broglie, Louis Morge-Rollet, D. L. Jeune, F. Roy, C. Roland, Charles Canaff, J. Diguet","doi":"10.23919/APSIPAASC55919.2022.9980109","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980109","url":null,"abstract":"Spectrum Sensing is an important part of Cognitive Radio (CR) process. It can be used to determine if a Primary User (PU) (i.e. a licensed user) is emitting or not in the communication channel. This paper presents and compares three types of FFT-based detection algorithms for LTE-Advanced (LTE-A) cellular network at Orthogonal Frequency Division Multiple Access (OFDMA) level. These detectors sense the usage of the minimum time-frequency called Resource Block (RB). They are also low latency detectors and they only need one particular Orthogonal Frequency Division Multiplexing (OFDM) symbol to detect the usage of one RB. The three new detectors are based respectively on energy, correlation, and one what will be called eogration which combines energy and correlation. We analyze them with the Fisher's ratio and simulations of hypothesis test. The computing complexity of these detectors is also theoretically analyzed to provide guidance for future implementations.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130550041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Effect of Noise on the Perceptual Contribution of Cochlea-Scaled Entropy and Speech Level in Mandarin Sentence Understanding","authors":"Weikang Wu, Shangdi Liao, Fei Chen","doi":"10.23919/APSIPAASC55919.2022.9979873","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9979873","url":null,"abstract":"Many studies investigated the impact of various speech segments to speech intelligibility in order to identify important information-bearing regions for the design of new speech processing methods, e.g., speech enhancement. Early findings suggested that cochlea-scaled entropy (CSE) and speech level were important indicators accounting for speech intelligibility in quiet condition. This study further compared the perceptual contributions of CSE and speech level under noisy conditions. Mandarin sentences were masked by steady-state noise and two-talker babble, edited to generate high-entropy-only and high-level-only stimuli, preserving segments with the largest CSEs and the highest levels in clean sentences respectively and replacing the rest with noise, and played to normal-hearing listeners to recognize. Results showed that high-entropy-only stimuli were more intelligible than high-level-only stimuli under noisy conditions. This intelligibility benefit may be attributed to the amount of vowel-consonant transitions, and not to differences in effective signal-to-noise ratios, between the two types of stimuli.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132895723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Novel Smart Sectoring and Beam Designs in mmWave Broadcast Channels","authors":"Yang He, S. Tsai, Jen-Ming Wu","doi":"10.23919/APSIPAASC55919.2022.9979922","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9979922","url":null,"abstract":"This work proposes a smart sectoring scheme for the mm Wave broadcast systems to enhance the throughput by overcoming the inefficient power consumption due to transmit-ting power to undesired user directions in traditional sectoring systems. We optimize the beam pattern of the proposed scheme so that multiple users can be simultaneously served by only one RF chain. As a result, the hardware complexity can be greatly reduced. Simulation results show that the proposed sectoring scheme significantly outperforms the traditional ones under the same numbers of RF chains and antennas. In addition, the advantage of the proposed scheme is also revealed in the complexity. That is, even with only one RF chain, the proposed system can still achieve the close performance of the traditional systems with multiple RF chains (benchmark).","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131938783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Study on Low-Latency Recognition-Synthesis-Based Any-to-One Voice Conversion","authors":"Yi-Yang Ding, Li-Juan Liu, Yu Hu, Zhenhua Ling","doi":"10.23919/APSIPAASC55919.2022.9980091","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980091","url":null,"abstract":"Some application scenarios of voice conversion, such as identity disguise in voice communication, require low-latency generation of converted speech. In traditional conversion methods, both history and future information in input speech are utilized to predict the converted acoustic features at each frame, which leads to long latency of voice conversion. Therefore, this paper proposes a low-latency recognition-synthesis-based any-to-one voice conversion method. Bottleneck (BN) features are extracted by an automatic speech recognition (ASR) acoustic model for frame-by-frame phoneme classification. A minimum mutual information (MMI) loss is introduced to reduce the speaker information in BNs caused by the low-latency configuration. The BN features are sent into a speaker-dependent low-latency LSTM-based acoustic feature predictor and the speech waveforms are reconstructed by an LPCNet vocoder from predicted acoustic features. The total latency of our proposed voice conversion method is 190ms, which is less than the delay requirement for comfortable communication in ITU-T G.114. The naturalness of converted speech is comparable with the upper-bound model trained without low-latency constraints.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122274505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Flow-Based Variational Sequence Autoencoder","authors":"Jen-Tzung Chien, Tien-Ching Luo","doi":"10.23919/APSIPAASC55919.2022.9979970","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9979970","url":null,"abstract":"Posterior collapse, also known as the Kullback-Leibler (KL) vanishing, is a long-standing problem in variational recurrent autoencoder (VRAE) which is essentially developed for sequence generation. To alleviate the vanishing problem, a complicated latent variable is required instead of assuming it as standard Gaussian. Normalizing flow was proposed to build the bijective neural network which converts a simple distribution into a complex distribution. The resulting approximate posterior is closer to real posterior for better sequence generation. The KL divergence in learning objective is accordingly preserved to enrich the capability of generating the diverse sequences. This paper presents the flow-based VRAE to build the disentangled latent representation for sequence generation. KL preserving flows are exploited for conditional VRAE and evaluated for text representation as well as dialogue generation. In the im-plementation, the schemes of amortized regularization and skip connection are further imposed to strengthen the embedding and prediction. Experiments on different tasks show the merit of this latent variable representation for language modeling, sentiment classification and dialogue generation.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114068097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sayaka Shiota, Ryo Imaizumi, Ryo Masumura, H. Kiya
{"title":"Dialect-aware Semi-supervised Learning for End-to-End Multi-dialect Speech Recognition","authors":"Sayaka Shiota, Ryo Imaizumi, Ryo Masumura, H. Kiya","doi":"10.23919/APSIPAASC55919.2022.9980139","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980139","url":null,"abstract":"In this paper, we propose dialect-aware semi- supervised learning for end-to-end automatic speech recognition (ASR) models considering multi-dialect speech. Some multi- domain ASR tasks require a large amount of training data containing additional information (e.g., language or dialect), whereas it is difficult to prepare such data with accurate transcriptions. Semi-supervised learning is a method of using a massive amount of untranscribed data effectively, and it can be applied to multi-domain ASR tasks to relax the missing transcriptions problem. However, semi-supervised learning has usually used generated pseudo-transcriptions only. The problem is that simply combining a multi-domain model with semi- supervised learning makes use of no additional information even though the information can be obtained. Therefore, in this paper, we focus on semi-supervised learning based on a multi-domain model that takes additional domain information into account. Since the accuracy of pseudo-transcriptions can be improved by using the multi-domain model and additional information, our proposed semi-supervised learning is expected to provide a reliable ASR model. In experiments, we performed Japanese multi-dialect ASR as one type of multi-domain ASR. From the results, a model trained with the proposed method yielded the lowest character error rate compared with other models trained with the conventional semi-supervised method.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115266795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Self-Consistency Training with Hierarchical Temporal Aggregation for Sound Event Detection","authors":"Yunlong Li, Xiujuan Zhu, Mingyu Wang, Ying Hu","doi":"10.23919/APSIPAASC55919.2022.9980285","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980285","url":null,"abstract":"In this paper, we propose a sound event detection (SED) method based on the self-consistency training (SCT) strategy and a hierarchical temporal aggregation (HTA) module, named SCT-HTA. This method adopts Mean Teacher (MT) semi-supervised learning method, exploiting a dual-branch convolutional recurrent neural network (CRNN) structure including the main branch and auxiliary branch. We adopt an SCT strategy to apply the self-consistency regularization in addition to the MT loss to maintain the consistency between the outputs of the auxiliary and main branches. Furthermore, an HTA module is designed to aggregate the information at different temporal resolutions. We also explored three aggregators to be applied in the HTA module and four kinds of combinations of pooling methods in the localization modules of two branches. Experimental results demonstrate that our proposed SCT-HTA method outperforms the four compared methods. The results show that the max pooling aggregator has a better ability to highlight the location of sound events. And the “linear softmax + attention” combination of the pooling method achieves the best performance.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121025782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}