{"title":"Direction-aware target speaker extraction with a dual-channel system based on conditional variational autoencoders under underdetermined conditions","authors":"Rui Wang, Li Li, T. Toda","doi":"10.23919/APSIPAASC55919.2022.9979881","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9979881","url":null,"abstract":"In this paper, we deal with a dual-channel target speaker extraction (TSE) problem under underdetermined con-ditions. For the dual-channel system, the generalized sidelobe canceller (GSC) is a commonly used structure for estimating a blocking matrix (BM) to generate interference, and geometric source separation (GSS) can be used as an implementation of BM estimation utilizing directional information. However, the performance of the conventional GSS methods is limited under underdetermined conditions because of the lack of a powerful source model. In this paper, we propose a dual-channel TSE method that combines the ability of target selection based on geometric constraints, more powerful source modeling, and nonlinear postprocessing. The target directional information is used as a geometric constraint, and two conditional variational auto encoders (CVAEs) are used to model a single speaker's speech and interference mixture speech. For the postprocessing, an ideal ratio Time-Frequency (T-F) mask estimated from the separated interference mixture speech is used to extract the target speaker's speech. The experimental results demonstrate that the proposed method achieves 6.24 dB and 8.37 dB improvements compared with the baseline method in terms of signal-to-distortions ratio (SDR) and source-to-interferences ratio (SIR) respectively under strong reverberation for 470 ms.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116382658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DCAN: Deep Consecutive Attention Network for Video Super Resolution","authors":"Talha Saleem, Sovann Chen, S. Aramvith","doi":"10.23919/APSIPAASC55919.2022.9979823","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9979823","url":null,"abstract":"Slow motion is visually attractive in video applications and gets more attention in video super-resolution (VSR). To generate the high-resolution (HR) center frame with its neighbor HR frames from the low-resolution (LR) of two frames. Two sub-tasks are required, including video super-resolution (VSR) and video frame interpolation (VFI). However, the interpolation approach does not successfully extract low-level features to achieve the acceptable result of space-time video super-resolution. Therefore, the restoration performance of existing systems is constrained due to rarely considering the spatial-temporal correlation and the long-term temporal context concurrently. To this extent, we propose a deep consecutive attention network-based method to generate attentive features to get HR slow-motion frames. A channel attention module and an attentive temporal feature module are designed to improve the perceptual quality of predicted interpolation feature frames. The experimental results show the proposed method outperforms 0.17 dB in an average PSNR compared to the state-of-the-art baseline method.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115578252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Encrypted JPEG Image Retrieval via Huffman-code Based Self-Attention Networks","authors":"Zhixun Lu, Qihua Feng, Peiya Li","doi":"10.23919/APSIPAASC55919.2022.9979814","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9979814","url":null,"abstract":"Image retrieval has been widely used in daily life. In recent years, with the increasing awareness of privacy protection, encrypted image retrieval has also been gradually developed. In this paper, we propose a new encrypted JPEG image retrieval scheme, named Huffman-code Based Self-Attention Networks (HBSAN), which could conduct image retrieval and protect image privacy effectively. To be specific, we first extract Huffman-code histograms directly from cipher-images which are encrypted by jointly using new orthogonal transformation, permutation cipher and stream cipher during JPEG compression. Then we employ the self-attention neural networks to mine the deep relations and retrieve the cipher-images. In our retrieval model, we design a self-attention multi-layer perceptron module, called SAMLP, to effectively learn global dependencies within representations of cipher-images. Extensive experiments present our encryption algorithm is compression-friendly, ensures no information leakage, and HBSAN significantly outperforms other state-of-the-art models in retrieval performance.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115597106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-branch Learning for Noisy and Reverberant Monaural Speech Separation","authors":"Chao Ma, Dongmei Li","doi":"10.23919/APSIPAASC55919.2022.9980244","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980244","url":null,"abstract":"With the rapid development of deep learning approaches, much progress has been made on speech enhancement, speech dereverberation, and monaural multi- speaker speech separation to solve the cocktail problem. Some excellent methods have been proposed to solve the monaural speech separation in a noisy and reverberant environment. However, few studies exploit the correlations between anechoic speech and reverberant speech. In this work, the structure of a popular separation system is deconstructed, and a multi-branch learning method is proposed to enforce the network to exploit the correlations between anechoic speech and the corresponding reverberant speech. The results show that using multi-branch learning can improve the separation performance of different networks by 0.7dB with conv-tasnet on the WHAMR! dataset.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"58 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114126699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-resolution GPR clutter suppression method based on low-rank and sparse decomposition","authors":"Yanjie Cao, Xiaopeng Yang, T. Lan","doi":"10.23919/APSIPAASC55919.2022.9980215","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980215","url":null,"abstract":"The clutter encountered in ground-penetrating radar (GPR) seriously affects the detection and identification for the subsurface target, which has been widely studied in recent years. A low-rank and sparse decomposition (LRSD) method with multi-resolution is introduced in this paper. First, the raw GPR data is decomposed by stationary wavelet transform (SWT) to obtain different sub-bands. Then, the robust non-negative matrix factorization (RNMF) is used for approximation sub-bands and horizontal wavelet sub-bands to extract the target sparse parts. Next, the wavelet soft threshold de-noising is used for the vertical and diagonal wavelet sub-bands. Finally, the inverse wavelet transform of processed sub-bands is performed to reconstruct the target signal. The proposed method is compared with the subspace method and LRSD methods on both simulation data and real collected data. Visual and quantitative results show that the proposed method has better clutter suppression performance.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115224478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fine-Tuning BERT for Question and Answering Using PubMed Abstract Dataset","authors":"Saeyeon Cheon, Insung Ahn","doi":"10.23919/APSIPAASC55919.2022.9980097","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980097","url":null,"abstract":"The coronavirus, which first originated in China in 2019, spread worldwide and eventually reached a pandemic situation. In the interest of many people, misinformation about the coronavirus has been pouring out on the Internet. We developed a Q&A processing technique by building a dataset based on the PubMed paper abstract for people to easily get the right information. We fine-tuned BioBERT among the BERT models that reached SOTA performance in the biomedical Q&A task. It answered questions about coronavirus with high accuracy. In the future, we will develop our technology that can handle Q&A not only in English but also in multiple languages. This work will contribute to helping people who speak different languages easily obtain correct information amidst confusing data.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"201 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116169326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parameterization of Dominant Spectral Peak Trajectory for Whisper Speech Recognition","authors":"Chang Feng, Xiaolong Wu, Mingxing Xu, T. Zheng","doi":"10.23919/APSIPAASC55919.2022.9980259","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980259","url":null,"abstract":"Automatic speech recognition (ASR) systems trained on normal speech generally suffer from performance degradations for whisper speech. To solve this problem, this paper concentrates on utilizing similar factors between normal and whisper speech to construct a whisper speech recognizer with normal speech data. We propose to parameterize the dominant spectral peak trajectory (Ppeak) to capture the similarities and concatenate it to the traditional Mel-Frequency Cepstral Coefficients (MFCC) and Human Factor Cepstral Coefficients (HFCC), respectively, to form new features. The proposed features benefit to the accuracy of whisper speech recognition. Performance improvement can be further achieved when the similarity is enhanced by removing low frequency information. Experimental results show that the performance degradation between match and mismatch scenarios was reduced relatively by 90.31% in Word Error Rate (WER) for HFCC after similarity enhancement at a cut-off frequency of 500Hz. Furthermore, we ultimately achieved a relative reduction of 69.60% in WER in the mismatch scenario compared with conventional MFCC even without whisper speech data for training.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114250363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Correlation Loss for MOS Prediction of Synthetic Speech","authors":"Beibei Hu, Qiang Li","doi":"10.23919/APSIPAASC55919.2022.9980182","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980182","url":null,"abstract":"For the speech mean opinion score (MOS) prediction task, many deep-learning-based methods are developed. Generally, system-level and utterance-level mean squared error (MSE), Linear Correlation Coefficient (LCC), Spearman Rank Correlation Coefficient (SRCC), and Kendall Tau Rank Correlation (KTAU) are leveraged as the evaluation metrics. However, we find that the objective functions for many MOS prediction networks are MAE or MSE based without an explicit correlation objective part. This paper investigates different correlation losses for voice MOS prediction networks. Based on the datasets and SSL-MOS baseline system provided by VoiceMOsChallenge 2022, we employ different auxiliary correlation losses to train the MOS prediction network. The experiment results show that the suggested auxiliary correlation losses increase the performance of the SSL-MOS network on the six correlation metrics. Compared with the two best-performing systems in the VoiceMOsChallenge 2022, our approach achieves close performance on the system-level correlation metrics with simpler system architecture.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"124 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114512378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Restoring Edge and Color using Weighted Near-Infrared Image and Color Transmission Maps for Robust Haze Removal","authors":"Onhi Kato, Akira Kubota","doi":"10.23919/APSIPAASC55919.2022.9979960","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9979960","url":null,"abstract":"In recent years, various haze removal methods based on atmospheric scattering models have been proposed. Most methods target strong haze images in which light is scattered equally in all color channels. This paper proposes a haze removal method using near-infrared (NIR) images for weak haze images. The proposed method first restores the edges of color images by fusing weighted NIR images. Second, it estimates transmission maps for all color channels based on a wavelength-dependent scattering model and restores the color of the edge-restored image using the estimated transmission maps. Finally, the edge- restored and color-restored images are blended. Qualitative and quantitative evaluations demonstrate that the proposed method can restore edges and colors more naturally in weak haze images than conventional methods.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114604464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Camera-Based Log System for Human Physical Distance Tracking in Classroom","authors":"S. Deepaisarn, Angkoon Angkoonsawaengsuk, Charn Arunkit, Chayud Srisumarnk, Krongkan Nimmanwatthana, Nanmanas Linphrachaya, Nattapol Chiewnawintawat, Rinrada Tanthanathewin, Sivakorn Seinglek, Suphachok Buaruk, Virach Sornlertlamvanich","doi":"10.23919/APSIPAASC55919.2022.9980055","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980055","url":null,"abstract":"In the pandemic of COVID-19, the indoor physical distancing protocol has been one of the recommendations for people to avoid close contact with each other in order to prevent contagious clusters. This paper proposes an end-to-end camera-based human physical distancing recording system for an indoor environment, specifically, a classroom. The recording system aims to automatically trace the locations of persons and the directions of their movements in a classroom, also with respect to the on- and off-seat activities. No identity of persons is kept in the recording log system, but locations of individual persons at each timestamp are obtained; hence, the spatial and temporal distribution can be studied further. In this paper, we illustrate the overview workflow of the human and seat detection as well as the log system storing human physical distancing actions.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114899944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}