Dongpeng Ma, Yiwen Wang, Liqiang He, Mingjie Jin, Dan Su, Dong Yu
{"title":"DP-DWA: Dual-Path Dynamic Weight Attention Network With Streaming Dfsmn-San For Automatic Speech Recognition","authors":"Dongpeng Ma, Yiwen Wang, Liqiang He, Mingjie Jin, Dan Su, Dong Yu","doi":"10.1109/icassp43922.2022.9746328","DOIUrl":"https://doi.org/10.1109/icassp43922.2022.9746328","url":null,"abstract":"In multi-channel far-field automatic speech recognition (ASR) scenarios, distortion is introduced when the speech signal is processed by the front end, which damages the recognition performance for the ASR tasks. In this paper, we propose a dual-path network for the far-field acoustic model, which uses voice processing (VP) signal and acoustic echo cancellation (AEC) signal as input. Specifically, we design a dynamic weight attention (DWA) module for combining two signals. Besides, we streamline our best deep feed-forward sequential memory network with self-attention (DFSMN-SAN) acoustic model for real-time requirements. Joint-training strategy is adopted to optimize the proposed approach. We find that with dual-path network, we can achieve a 54.5% relative improvement in character error rate (CER) on a 10,000-hour online conference task. In addition, our proposed method is not affected by the arrangement of different microphone arrays. We achieve a 23.56% relative improvement on a vehicle task, which has an array with two microphones.","PeriodicalId":272439,"journal":{"name":"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117308879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sang-Hoon Lee, Ji-Hoon Kim, Kangeun Lee, Seong-Whan Lee
{"title":"FRE-GAN 2: Fast and Efficient Frequency-Consistent Audio Synthesis","authors":"Sang-Hoon Lee, Ji-Hoon Kim, Kangeun Lee, Seong-Whan Lee","doi":"10.1109/icassp43922.2022.9746675","DOIUrl":"https://doi.org/10.1109/icassp43922.2022.9746675","url":null,"abstract":"Although recent advances in neural vocoder have shown significant improvement, most of these models have a trade-off between audio quality and computational complexity. Since the large model has a limitation on the low-resource devices, a more efficient neural vocoder should synthesize high-quality audio for practical applicability. In this paper, we present Fre-GAN 2, a fast and efficient high-quality audio synthesis model. For fast synthesis, Fre-GAN 2 only synthesizes low and high-frequency parts of the audio, and we leverage the inverse discrete wavelet transform to reproduce the target-resolution audio in the generator. Additionally, we also introduce adversarial periodic feature distillation, which makes the model synthesize high-quality audio with only a small parameter. The experimental results show the superiority of Fre-GAN 2 in audio quality. Furthermore, Fre-GAN 2 has a 10.91× generation acceleration, and the parameters are compressed by 21.23× than Fre-GAN.","PeriodicalId":272439,"journal":{"name":"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129692688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Local Context Interaction-Aware Glyph-Vectors for Chinese Sequence Tagging","authors":"Junyu Lu, Pingjian Zhang","doi":"10.1109/icassp43922.2022.9747303","DOIUrl":"https://doi.org/10.1109/icassp43922.2022.9747303","url":null,"abstract":"As hieroglyphics, Chinese characters contain rich semantic and glyphs information, which is beneficial to sequence tagging task. However, it’s difficult for shallow CNNs architecture to extract glyphs information from character data and implement the con-textual interaction of different glyphs information effectively. In this paper, we address these issues by presenting LCIN: a Local Context Interaction-aware Network for glyph-vectors extraction. The network utilizes depthwise separable convolution and attention machine to extract glyphs information from images of Chinese characters. Moreover, we interconnect adjacent attention blocks so that glyphs information can flow within the local context. Experiments on three subtasks for sequence tagging show that our method out-performs other glyph-based models and achieves new SOTA results in a wide range of datasets.","PeriodicalId":272439,"journal":{"name":"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129730452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
John Harvill, Yash R. Wani, Moitreya Chatterjee, M. Alam, D. Beiser, David Chestek, M. Hasegawa-Johnson, N. Ahuja
{"title":"Detection of Covid-19 from Joint Time and Frequency Analysis of Speech, Breathing and Cough Audio","authors":"John Harvill, Yash R. Wani, Moitreya Chatterjee, M. Alam, D. Beiser, David Chestek, M. Hasegawa-Johnson, N. Ahuja","doi":"10.1109/icassp43922.2022.9746015","DOIUrl":"https://doi.org/10.1109/icassp43922.2022.9746015","url":null,"abstract":"The distinct cough sounds produced by a variety of respiratory diseases suggest the potential for the development of a new class of audio bio-markers for the detection of COVID-19. Accurate audio biomarker-based COVID-19 tests would be inexpensive, readily scalable, and non-invasive. Audio biomarker screening could also be utilized in resource-limited settings prior to traditional diagnostic testing. Here we explore the possibility of leveraging three audio modalities: cough, breathing, and speech to determine COVID-19 status. We train a separate neural classification system on each modality, as well as a fused classification system on all three modalities together. Ablation studies are performed to understand the relationship between individual and collective performance of the modalities. Additionally, we analyze the extent to which temporal and spectral features contribute to COVID-19 status information contained in the audio signals.","PeriodicalId":272439,"journal":{"name":"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"140 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128475931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning Monocular 3D Human Pose Estimation With Skeletal Interpolation","authors":"Ziyi Chen, A. Sugimoto, S. Lai","doi":"10.1109/icassp43922.2022.9746410","DOIUrl":"https://doi.org/10.1109/icassp43922.2022.9746410","url":null,"abstract":"Deep learning has achieved unprecedented accuracy for monocular 3D human pose estimation. However, current learning-based 3D human pose estimation still suffers from poor generalization. Inspired by skeletal animation, which is popular in game development and animation production, we put forward an simple, intuitive yet effective interpolation-based data augmentation approach to synthesize continuous and diverse 3D human body sequences to enhance model generalization. The Transformer-based lifting network, trained with the augmented data, utilizes the self-attention mechanism to perform 2D-to-3D lifting and successfully infer high-quality predictions in the qualitative experiment. The quantitative result of cross-dataset experiment demonstrates that our resulting model achieves superior generalization accuracy on the publicly available dataset.","PeriodicalId":272439,"journal":{"name":"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"2002 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128282646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Regularization Using Denoising: Exact and Robust Signal Recovery","authors":"Ruturaj G. Gavaskar, K. Chaudhury","doi":"10.1109/ICASSP43922.2022.9747396","DOIUrl":"https://doi.org/10.1109/ICASSP43922.2022.9747396","url":null,"abstract":"We consider the problem of signal reconstruction from linearly corrupted data using plug-and-play (PnP) regularization. As opposed to traditional sparsity-promoting regularizers, PnP uses an off-the-shelf denoiser within a proximal algorithm such as ISTA or ADMM for image reconstruction. Although PnP has become popular in the imaging community, its regularization capacity is not fully understood. For example, it is not known if PnP can in theory recover a signal from few noiseless measurements as in classical compressed sensing and if the recovery is robust. We explore these questions in this work and present some theoretical and experimental results. In particular, we prove that if the denoiser in question has low rank and if the ground- truth lies in the range of the denoiser, then it can be recovered exactly from noiseless measurements. To the best of knowledge, this is first such result. Furthermore, we show using numerical simulations that even if the aforementioned conditions are violated, PnP recovery is robust in practice. We formulate a theorem regarding the recovery error based on these observations.","PeriodicalId":272439,"journal":{"name":"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129090664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sondes Abderrazek, C. Fredouille, A. Ghio, M. Lalain, Christine Meunier, V. Woisard
{"title":"Towards Interpreting Deep Learning Models to Understand Loss of Speech Intelligibility in Speech Disorders Step 2: Contribution of the Emergence of Phonetic Traits","authors":"Sondes Abderrazek, C. Fredouille, A. Ghio, M. Lalain, Christine Meunier, V. Woisard","doi":"10.1109/icassp43922.2022.9746198","DOIUrl":"https://doi.org/10.1109/icassp43922.2022.9746198","url":null,"abstract":"Apart from the impressive performance it has achieved in several tasks, one of the most important factors remaining for the continuous progress of deep learning is the increased work related to interpretability, especially in a medical context. In a recent work, we presented competitive performance achieved with a CNN-based model trained on normal speech for the French phone classification and how it correlates well with different perceptual measures when exposed to disordered speech. This paper extends that work by focusing on interpretability. Here, the goal is to get insights into the way in which neural representations shape the final task of phone classification so that it can be used further to explain the loss of intelligibility in disordered speech. In this way, an original framework is proposed, relying firstly on the neural activity and a novel representation per neuron, here considering the phone classification, and, secondly, permitting to identify a set of neurons devoted to the detection of specific phonetic traits on normal speech. Faced to disordered speech, a degradation of that set of neurons is observed, demonstrating a loss of specific phonetic traits in some patients involved, and the potentiality of the proposed approaches to inform about speech alteration.","PeriodicalId":272439,"journal":{"name":"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124596690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Bordin, Caio Gomes de Figueredo, Marcelo G. S. Bruno
{"title":"Distributed Particle Filters for State Tracking on the Stiefel Manifold Using Tangent Space Statistics","authors":"C. Bordin, Caio Gomes de Figueredo, Marcelo G. S. Bruno","doi":"10.1109/icassp43922.2022.9746305","DOIUrl":"https://doi.org/10.1109/icassp43922.2022.9746305","url":null,"abstract":"This paper introduces a novel distributed diffusion algorithm for tracking the state of a dynamic system that evolves on the Stiefel manifold. To compress information exchanged between nodes, the algorithm builds a Gaussian parametric approximation to the particles that are previously projected onto the tangent space to the Stiefel manifold and mapped to real vectors. Observations from neighboring nodes are then assimilated for a general nonlinear observation model. Performance results are compared to those of competing linear diffusion Extended Kalman Filters and other particle filters.","PeriodicalId":272439,"journal":{"name":"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129664078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tao Qian, Jiatong Shi, Shuai Guo, Peter Wu, Qin Jin
{"title":"Training Strategies for Automatic Song Writing: A Unified Framework Perspective","authors":"Tao Qian, Jiatong Shi, Shuai Guo, Peter Wu, Qin Jin","doi":"10.1109/icassp43922.2022.9746818","DOIUrl":"https://doi.org/10.1109/icassp43922.2022.9746818","url":null,"abstract":"Automatic song writing (ASW) typically involves four tasks: lyric-to-lyric generation, melody-to-melody generation, lyric-to-melody generation, and melody-to-lyric generation. Previous works have mainly focused on individual tasks without considering the correlation between them, and thus a unified framework to solve all four tasks has not yet been explored. In this paper, we propose a unified framework following the pre-training and fine-tuning paradigm to address all four ASW tasks with one model. To alleviate the data scarcity issue of paired lyric-melody data for lyric-to-melody and melody-to-lyric generation, we adopt two pre-training stages with unpaired data. In addition, we introduce a dual transformation loss to fully utilize paired data in the fine-tuning stage to enforce the weak correlation between melody and lyrics. We also design an objective music generation evaluation metric involving the chromatic rule and a more realistic setting, which removes some strict assumptions adopted in previous works. To the best of our knowledge, this work is the first to explore ASW for pop songs in Chinese. Extensive experiments demonstrate the effectiveness of the dual transformation loss and the unified model structure encompassing all four tasks. The experimental results also show that our proposed new evaluation metric aligns better with subjective opinion scores from human listeners.","PeriodicalId":272439,"journal":{"name":"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130388911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hyunwoo Yu, J. Shim, Jaeho Kwak, J. Song, Suk-Ju Kang
{"title":"Vision Transformer-Based Retina Vessel Segmentation with Deep Adaptive Gamma Correction","authors":"Hyunwoo Yu, J. Shim, Jaeho Kwak, J. Song, Suk-Ju Kang","doi":"10.1109/icassp43922.2022.9747597","DOIUrl":"https://doi.org/10.1109/icassp43922.2022.9747597","url":null,"abstract":"Accurate segmentation of the retina vessel is essential for the early diagnosis of eye-related diseases. Recently, convolutional neural networks have shown remarkable performance in retina vessel segmentation. However, the complexity of edge structural information and the changeable intensity distribution depending on retina images reduce the performance of the segmentation tasks. This paper proposes two novel deep learning-based modules, channel attention vision transformer (CAViT) and deep adaptive gamma correction (DAGC), to tackle these issues. The CAViT jointly applies the efficient channel attention (ECA) and the vision transformer (ViT), in which the channel attention module considers the interdependency among feature channels and the ViT discriminates meaningful edge structures by considering the global context. The DAGC module provides the optimal gamma correction value for each input image by jointly training a CNN model with the segmentation network so that all the retina images are mapped to a unified intensity distribution. The experimental results show that our proposed method achieves superior performance compared to conventional methods on widely used datasets, DRIVE and CHASE DB1.","PeriodicalId":272439,"journal":{"name":"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127040555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}