ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)最新文献

筛选
英文 中文
DP-DWA: Dual-Path Dynamic Weight Attention Network With Streaming Dfsmn-San For Automatic Speech Recognition DP-DWA:基于流式Dfsmn-San的自动语音识别双路径动态权重注意网络
Dongpeng Ma, Yiwen Wang, Liqiang He, Mingjie Jin, Dan Su, Dong Yu
{"title":"DP-DWA: Dual-Path Dynamic Weight Attention Network With Streaming Dfsmn-San For Automatic Speech Recognition","authors":"Dongpeng Ma, Yiwen Wang, Liqiang He, Mingjie Jin, Dan Su, Dong Yu","doi":"10.1109/icassp43922.2022.9746328","DOIUrl":"https://doi.org/10.1109/icassp43922.2022.9746328","url":null,"abstract":"In multi-channel far-field automatic speech recognition (ASR) scenarios, distortion is introduced when the speech signal is processed by the front end, which damages the recognition performance for the ASR tasks. In this paper, we propose a dual-path network for the far-field acoustic model, which uses voice processing (VP) signal and acoustic echo cancellation (AEC) signal as input. Specifically, we design a dynamic weight attention (DWA) module for combining two signals. Besides, we streamline our best deep feed-forward sequential memory network with self-attention (DFSMN-SAN) acoustic model for real-time requirements. Joint-training strategy is adopted to optimize the proposed approach. We find that with dual-path network, we can achieve a 54.5% relative improvement in character error rate (CER) on a 10,000-hour online conference task. In addition, our proposed method is not affected by the arrangement of different microphone arrays. We achieve a 23.56% relative improvement on a vehicle task, which has an array with two microphones.","PeriodicalId":272439,"journal":{"name":"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117308879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
FRE-GAN 2: Fast and Efficient Frequency-Consistent Audio Synthesis 快速高效的频率一致音频合成
Sang-Hoon Lee, Ji-Hoon Kim, Kangeun Lee, Seong-Whan Lee
{"title":"FRE-GAN 2: Fast and Efficient Frequency-Consistent Audio Synthesis","authors":"Sang-Hoon Lee, Ji-Hoon Kim, Kangeun Lee, Seong-Whan Lee","doi":"10.1109/icassp43922.2022.9746675","DOIUrl":"https://doi.org/10.1109/icassp43922.2022.9746675","url":null,"abstract":"Although recent advances in neural vocoder have shown significant improvement, most of these models have a trade-off between audio quality and computational complexity. Since the large model has a limitation on the low-resource devices, a more efficient neural vocoder should synthesize high-quality audio for practical applicability. In this paper, we present Fre-GAN 2, a fast and efficient high-quality audio synthesis model. For fast synthesis, Fre-GAN 2 only synthesizes low and high-frequency parts of the audio, and we leverage the inverse discrete wavelet transform to reproduce the target-resolution audio in the generator. Additionally, we also introduce adversarial periodic feature distillation, which makes the model synthesize high-quality audio with only a small parameter. The experimental results show the superiority of Fre-GAN 2 in audio quality. Furthermore, Fre-GAN 2 has a 10.91× generation acceleration, and the parameters are compressed by 21.23× than Fre-GAN.","PeriodicalId":272439,"journal":{"name":"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129692688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Local Context Interaction-Aware Glyph-Vectors for Chinese Sequence Tagging 中文序列标注的局部上下文交互感知符号向量
Junyu Lu, Pingjian Zhang
{"title":"Local Context Interaction-Aware Glyph-Vectors for Chinese Sequence Tagging","authors":"Junyu Lu, Pingjian Zhang","doi":"10.1109/icassp43922.2022.9747303","DOIUrl":"https://doi.org/10.1109/icassp43922.2022.9747303","url":null,"abstract":"As hieroglyphics, Chinese characters contain rich semantic and glyphs information, which is beneficial to sequence tagging task. However, it’s difficult for shallow CNNs architecture to extract glyphs information from character data and implement the con-textual interaction of different glyphs information effectively. In this paper, we address these issues by presenting LCIN: a Local Context Interaction-aware Network for glyph-vectors extraction. The network utilizes depthwise separable convolution and attention machine to extract glyphs information from images of Chinese characters. Moreover, we interconnect adjacent attention blocks so that glyphs information can flow within the local context. Experiments on three subtasks for sequence tagging show that our method out-performs other glyph-based models and achieves new SOTA results in a wide range of datasets.","PeriodicalId":272439,"journal":{"name":"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129730452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Detection of Covid-19 from Joint Time and Frequency Analysis of Speech, Breathing and Cough Audio 语音、呼吸和咳嗽音频时频联合分析检测Covid-19
John Harvill, Yash R. Wani, Moitreya Chatterjee, M. Alam, D. Beiser, David Chestek, M. Hasegawa-Johnson, N. Ahuja
{"title":"Detection of Covid-19 from Joint Time and Frequency Analysis of Speech, Breathing and Cough Audio","authors":"John Harvill, Yash R. Wani, Moitreya Chatterjee, M. Alam, D. Beiser, David Chestek, M. Hasegawa-Johnson, N. Ahuja","doi":"10.1109/icassp43922.2022.9746015","DOIUrl":"https://doi.org/10.1109/icassp43922.2022.9746015","url":null,"abstract":"The distinct cough sounds produced by a variety of respiratory diseases suggest the potential for the development of a new class of audio bio-markers for the detection of COVID-19. Accurate audio biomarker-based COVID-19 tests would be inexpensive, readily scalable, and non-invasive. Audio biomarker screening could also be utilized in resource-limited settings prior to traditional diagnostic testing. Here we explore the possibility of leveraging three audio modalities: cough, breathing, and speech to determine COVID-19 status. We train a separate neural classification system on each modality, as well as a fused classification system on all three modalities together. Ablation studies are performed to understand the relationship between individual and collective performance of the modalities. Additionally, we analyze the extent to which temporal and spectral features contribute to COVID-19 status information contained in the audio signals.","PeriodicalId":272439,"journal":{"name":"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"140 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128475931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Learning Monocular 3D Human Pose Estimation With Skeletal Interpolation 学习单目3D人体姿态估计与骨骼插值
Ziyi Chen, A. Sugimoto, S. Lai
{"title":"Learning Monocular 3D Human Pose Estimation With Skeletal Interpolation","authors":"Ziyi Chen, A. Sugimoto, S. Lai","doi":"10.1109/icassp43922.2022.9746410","DOIUrl":"https://doi.org/10.1109/icassp43922.2022.9746410","url":null,"abstract":"Deep learning has achieved unprecedented accuracy for monocular 3D human pose estimation. However, current learning-based 3D human pose estimation still suffers from poor generalization. Inspired by skeletal animation, which is popular in game development and animation production, we put forward an simple, intuitive yet effective interpolation-based data augmentation approach to synthesize continuous and diverse 3D human body sequences to enhance model generalization. The Transformer-based lifting network, trained with the augmented data, utilizes the self-attention mechanism to perform 2D-to-3D lifting and successfully infer high-quality predictions in the qualitative experiment. The quantitative result of cross-dataset experiment demonstrates that our resulting model achieves superior generalization accuracy on the publicly available dataset.","PeriodicalId":272439,"journal":{"name":"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"2002 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128282646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Regularization Using Denoising: Exact and Robust Signal Recovery 使用去噪的正则化:精确和鲁棒的信号恢复
Ruturaj G. Gavaskar, K. Chaudhury
{"title":"Regularization Using Denoising: Exact and Robust Signal Recovery","authors":"Ruturaj G. Gavaskar, K. Chaudhury","doi":"10.1109/ICASSP43922.2022.9747396","DOIUrl":"https://doi.org/10.1109/ICASSP43922.2022.9747396","url":null,"abstract":"We consider the problem of signal reconstruction from linearly corrupted data using plug-and-play (PnP) regularization. As opposed to traditional sparsity-promoting regularizers, PnP uses an off-the-shelf denoiser within a proximal algorithm such as ISTA or ADMM for image reconstruction. Although PnP has become popular in the imaging community, its regularization capacity is not fully understood. For example, it is not known if PnP can in theory recover a signal from few noiseless measurements as in classical compressed sensing and if the recovery is robust. We explore these questions in this work and present some theoretical and experimental results. In particular, we prove that if the denoiser in question has low rank and if the ground- truth lies in the range of the denoiser, then it can be recovered exactly from noiseless measurements. To the best of knowledge, this is first such result. Furthermore, we show using numerical simulations that even if the aforementioned conditions are violated, PnP recovery is robust in practice. We formulate a theorem regarding the recovery error based on these observations.","PeriodicalId":272439,"journal":{"name":"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129090664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Towards Interpreting Deep Learning Models to Understand Loss of Speech Intelligibility in Speech Disorders Step 2: Contribution of the Emergence of Phonetic Traits 解释深度学习模型以理解语言障碍中语言可理解性的丧失:语音特征出现的贡献
Sondes Abderrazek, C. Fredouille, A. Ghio, M. Lalain, Christine Meunier, V. Woisard
{"title":"Towards Interpreting Deep Learning Models to Understand Loss of Speech Intelligibility in Speech Disorders Step 2: Contribution of the Emergence of Phonetic Traits","authors":"Sondes Abderrazek, C. Fredouille, A. Ghio, M. Lalain, Christine Meunier, V. Woisard","doi":"10.1109/icassp43922.2022.9746198","DOIUrl":"https://doi.org/10.1109/icassp43922.2022.9746198","url":null,"abstract":"Apart from the impressive performance it has achieved in several tasks, one of the most important factors remaining for the continuous progress of deep learning is the increased work related to interpretability, especially in a medical context. In a recent work, we presented competitive performance achieved with a CNN-based model trained on normal speech for the French phone classification and how it correlates well with different perceptual measures when exposed to disordered speech. This paper extends that work by focusing on interpretability. Here, the goal is to get insights into the way in which neural representations shape the final task of phone classification so that it can be used further to explain the loss of intelligibility in disordered speech. In this way, an original framework is proposed, relying firstly on the neural activity and a novel representation per neuron, here considering the phone classification, and, secondly, permitting to identify a set of neurons devoted to the detection of specific phonetic traits on normal speech. Faced to disordered speech, a degradation of that set of neurons is observed, demonstrating a loss of specific phonetic traits in some patients involved, and the potentiality of the proposed approaches to inform about speech alteration.","PeriodicalId":272439,"journal":{"name":"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124596690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Distributed Particle Filters for State Tracking on the Stiefel Manifold Using Tangent Space Statistics 基于切线空间统计的Stiefel流形状态跟踪的分布粒子滤波
C. Bordin, Caio Gomes de Figueredo, Marcelo G. S. Bruno
{"title":"Distributed Particle Filters for State Tracking on the Stiefel Manifold Using Tangent Space Statistics","authors":"C. Bordin, Caio Gomes de Figueredo, Marcelo G. S. Bruno","doi":"10.1109/icassp43922.2022.9746305","DOIUrl":"https://doi.org/10.1109/icassp43922.2022.9746305","url":null,"abstract":"This paper introduces a novel distributed diffusion algorithm for tracking the state of a dynamic system that evolves on the Stiefel manifold. To compress information exchanged between nodes, the algorithm builds a Gaussian parametric approximation to the particles that are previously projected onto the tangent space to the Stiefel manifold and mapped to real vectors. Observations from neighboring nodes are then assimilated for a general nonlinear observation model. Performance results are compared to those of competing linear diffusion Extended Kalman Filters and other particle filters.","PeriodicalId":272439,"journal":{"name":"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129664078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Training Strategies for Automatic Song Writing: A Unified Framework Perspective 统一框架视角下的自动写歌训练策略
Tao Qian, Jiatong Shi, Shuai Guo, Peter Wu, Qin Jin
{"title":"Training Strategies for Automatic Song Writing: A Unified Framework Perspective","authors":"Tao Qian, Jiatong Shi, Shuai Guo, Peter Wu, Qin Jin","doi":"10.1109/icassp43922.2022.9746818","DOIUrl":"https://doi.org/10.1109/icassp43922.2022.9746818","url":null,"abstract":"Automatic song writing (ASW) typically involves four tasks: lyric-to-lyric generation, melody-to-melody generation, lyric-to-melody generation, and melody-to-lyric generation. Previous works have mainly focused on individual tasks without considering the correlation between them, and thus a unified framework to solve all four tasks has not yet been explored. In this paper, we propose a unified framework following the pre-training and fine-tuning paradigm to address all four ASW tasks with one model. To alleviate the data scarcity issue of paired lyric-melody data for lyric-to-melody and melody-to-lyric generation, we adopt two pre-training stages with unpaired data. In addition, we introduce a dual transformation loss to fully utilize paired data in the fine-tuning stage to enforce the weak correlation between melody and lyrics. We also design an objective music generation evaluation metric involving the chromatic rule and a more realistic setting, which removes some strict assumptions adopted in previous works. To the best of our knowledge, this work is the first to explore ASW for pop songs in Chinese. Extensive experiments demonstrate the effectiveness of the dual transformation loss and the unified model structure encompassing all four tasks. The experimental results also show that our proposed new evaluation metric aligns better with subjective opinion scores from human listeners.","PeriodicalId":272439,"journal":{"name":"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130388911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Vision Transformer-Based Retina Vessel Segmentation with Deep Adaptive Gamma Correction 基于视觉变换的视网膜血管分割与深度自适应伽玛校正
Hyunwoo Yu, J. Shim, Jaeho Kwak, J. Song, Suk-Ju Kang
{"title":"Vision Transformer-Based Retina Vessel Segmentation with Deep Adaptive Gamma Correction","authors":"Hyunwoo Yu, J. Shim, Jaeho Kwak, J. Song, Suk-Ju Kang","doi":"10.1109/icassp43922.2022.9747597","DOIUrl":"https://doi.org/10.1109/icassp43922.2022.9747597","url":null,"abstract":"Accurate segmentation of the retina vessel is essential for the early diagnosis of eye-related diseases. Recently, convolutional neural networks have shown remarkable performance in retina vessel segmentation. However, the complexity of edge structural information and the changeable intensity distribution depending on retina images reduce the performance of the segmentation tasks. This paper proposes two novel deep learning-based modules, channel attention vision transformer (CAViT) and deep adaptive gamma correction (DAGC), to tackle these issues. The CAViT jointly applies the efficient channel attention (ECA) and the vision transformer (ViT), in which the channel attention module considers the interdependency among feature channels and the ViT discriminates meaningful edge structures by considering the global context. The DAGC module provides the optimal gamma correction value for each input image by jointly training a CNN model with the segmentation network so that all the retina images are mapped to a unified intensity distribution. The experimental results show that our proposed method achieves superior performance compared to conventional methods on widely used datasets, DRIVE and CHASE DB1.","PeriodicalId":272439,"journal":{"name":"ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127040555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信