ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)最新文献

筛选
英文 中文
Convolutional Dropout and Wordpiece Augmentation for End-to-End Speech Recognition 端到端语音识别的卷积Dropout和词块增强
Hainan Xu, Yinghui Huang, Yun Zhu, Kartik Audhkhasi, B. Ramabhadran
{"title":"Convolutional Dropout and Wordpiece Augmentation for End-to-End Speech Recognition","authors":"Hainan Xu, Yinghui Huang, Yun Zhu, Kartik Audhkhasi, B. Ramabhadran","doi":"10.1109/ICASSP39728.2021.9415004","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9415004","url":null,"abstract":"Regularization and data augmentation are crucial to training end-to-end automatic speech recognition systems. Dropout is a popular regularization technique, which operates on each neuron independently by multiplying it with a Bernoulli random variable. We propose a generalization of dropout, called \"convolutional dropout\", where each neuron’s activation is replaced with a randomly-weighted linear combination of neuron values in its neighborhood. We believe that this formulation combines the regularizing effect of dropout with the smoothing effects of the convolution operation. In addition to convolutional dropout, this paper also proposes using random word-piece segmentations as a data augmentation scheme during training, inspired by results in neural machine translation. We adopt both these methods during the training of transformer-transducer speech recognition models, and show consistent WER improvements on Librispeech as well as across different languages.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127276237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Sparsity in Max-Plus Algebra and Applications in Multivariate Convex Regression Max-Plus代数的稀疏性及其在多元凸回归中的应用
Nikos Tsilivis, Anastasios Tsiamis, P. Maragos
{"title":"Sparsity in Max-Plus Algebra and Applications in Multivariate Convex Regression","authors":"Nikos Tsilivis, Anastasios Tsiamis, P. Maragos","doi":"10.1109/ICASSP39728.2021.9414139","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9414139","url":null,"abstract":"In this paper, we study concepts of sparsity in the max-plus algebra and apply them to the problem of multivariate convex regression. We show how to efficiently find sparse (containing many −∞ elements) approximate solutions to max-plus equations by leveraging notions from submodular optimization. Subsequently, we propose a novel method for piecewise-linear surface fitting of convex multivariate functions, with optimality guarantees for the model parameters and an approximately minimum number of affine regions.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127341557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Improving Identification of System-Directed Speech Utterances by Deep Learning of ASR-Based Word Embeddings and Confidence Metrics 通过深度学习基于 ASR 的单词嵌入和置信度指标,改进系统引导的语音片段识别
Vilayphone Vilaysouk, Amr H. Nour-Eldin, Dermot Connolly
{"title":"Improving Identification of System-Directed Speech Utterances by Deep Learning of ASR-Based Word Embeddings and Confidence Metrics","authors":"Vilayphone Vilaysouk, Amr H. Nour-Eldin, Dermot Connolly","doi":"10.1109/ICASSP39728.2021.9414330","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9414330","url":null,"abstract":"In this paper, we extend our previous work on the detection of system-directed speech utterances. This type of binary classification can be used by virtual assistants to create a more natural and fluid interaction between the system and the user. We explore two methods that both improve the Equal-Error-Rate (EER) performance of the previous model. The first exploits the supplementary information independently captured by ASR models through integrating ASR decoder-based features as additional inputs to the final classification stage of the model. This relatively improves EER performance by 13%. The second proposed method further integrates word embeddings into the architecture and, when combined with the first method, achieves a significant EER performance improvement of 48%, relative to that of the baseline.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127471574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Image Coding with Neural Network-Based Colorization 基于神经网络着色的图像编码
Diogo Lopes, J. Ascenso, Catarina Brites, Fernando Pereira
{"title":"Image Coding with Neural Network-Based Colorization","authors":"Diogo Lopes, J. Ascenso, Catarina Brites, Fernando Pereira","doi":"10.1109/ICASSP39728.2021.9413816","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9413816","url":null,"abstract":"Automatic colorization is a process with the objective of inferring the color of grayscale images. This process is frequently used for artistic purposes and to restore the color in old or damaged images. Motivated by the excellent results obtained with deep learning-based solutions in the area of automatic colorization, this paper proposes an image coding solution integrating a deep learning-based colorization process to estimate the chrominance components based on the decoded luminance which is regularly encoded with a conventional image coding standard. In this case, the chrominance components are not coded and transmitted as usual, notably after some subsampling, as only some color hints, i.e. chrominance values for specific pixel locations, may be sent to the decoder to help it creating more accurate colorizations. To boost the colorization and final compression performance, intelligent ways to select the color hints are proposed. Experimental results show performance improvements with the increased level of intelligence in the color hints extraction process and a good subjective quality of the final decoded (and colorized) images.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124859985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ATVIO: Attention Guided Visual-Inertial Odometry 注意引导视觉惯性里程计
Li Liu, Ge Li, Thomas H. Li
{"title":"ATVIO: Attention Guided Visual-Inertial Odometry","authors":"Li Liu, Ge Li, Thomas H. Li","doi":"10.1109/ICASSP39728.2021.9413912","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9413912","url":null,"abstract":"Visual-inertial odometry (VIO) aims to predict trajectory by ego- motion estimation. In recent years, end-to-end VIO has made great progress. However, how to handle visual and inertial measurements and make full use of the complementarity of cameras and inertial sensors remains a challenge. In the paper, we propose a novel attention guided deep framework for visual-inertial odometry (ATVIO) to improve the performance of VIO. Specifically, we extraordinarily concentrate on the effective utilization of the Inertial Measurement Unit (IMU) information. Therefore, we carefully design a one-dimension inertial feature encoder for IMU data processing. The network can extract inertial features quickly and effectively. Meanwhile, we should prevent the inconsistency problem when fusing inertial and visual features. Hence, we explore a novel cross-domain channel attention block to combine the extracted features in a more adaptive manner. Extensive experiments demonstrate that our method achieves competitive performance against state-of-the-art VIO methods.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124956268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Segmental Dtw: A Parallelizable Alternative to Dynamic Time Warping 分段Dtw:动态时间翘曲的可并行选择
T. Tsai
{"title":"Segmental Dtw: A Parallelizable Alternative to Dynamic Time Warping","authors":"T. Tsai","doi":"10.1109/ICASSP39728.2021.9413827","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9413827","url":null,"abstract":"In this work we explore parallelizable alternatives to DTW for globally aligning two feature sequences. One of the main practical limitations of DTW is its quadratic computation and memory cost. Previous works have sought to reduce the computational cost in various ways, such as imposing bands in the cost matrix or using a multiresolution approach. In this work, we utilize the fact that computation is an abundant resource and focus instead on exploring alternatives that approximate the inherently sequential DTW algorithm with one that is parallelizable. We describe two variations of an algorithm called Segmental DTW, in which the global cost matrix is broken into smaller sub-matrices, subsequence DTW is performed on each sub-matrix, and the results are used to solve a segment-level dynamic programming problem that specifies a globally optimal alignment path. We evaluate the proposed alignment algorithms on an audio-audio alignment task using the Chopin Mazurka dataset, and we show that they closely match the performance of regular DTW. We further demonstrate that almost all of the computations in Segmental DTW are parallelizable, and that one of the variants is unilaterally better than the other for both empirical and theoretical reasons.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124976832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
On the Detection of Pitch-Shifted Voice: Machines and Human Listeners 音高移位语音的检测:机器与人类听者
D. Looney, N. Gaubitch
{"title":"On the Detection of Pitch-Shifted Voice: Machines and Human Listeners","authors":"D. Looney, N. Gaubitch","doi":"10.1109/ICASSP39728.2021.9414890","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9414890","url":null,"abstract":"We present a performance comparison between human listeners and a simple algorithm for the task of speech anomaly detection. The algorithm utilises an intentionally small set of features derived from the source-filter model, with the aim of validating that key components of source-filter theory characterise how humans perceive anomalies. We furthermore recognise that humans are adept at detecting anomalies without prior exposure to a given anomaly class. To that end, we also consider the algorithm performance when operating via the principle of unsupervised learning where a null model is derived from normal speech recordings. We evaluate both the algorithm and human listeners for pitch-shift detection where the pitch of a speech sample is intentionally modified using software, a phenomenon of relevance to the fields of fraud detection and forensics. Our results show that humans can only detect pitch-shift reliably at more extreme levels, and that the performance of the algorithm matches closely with that of humans.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125024621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-Task Estimation of Age and Cognitive Decline from Speech 多任务估计年龄和言语认知衰退
Yilin Pan, Venkata Srikanth Nallanthighal, D. Blackburn, H. Christensen, Aki Härmä
{"title":"Multi-Task Estimation of Age and Cognitive Decline from Speech","authors":"Yilin Pan, Venkata Srikanth Nallanthighal, D. Blackburn, H. Christensen, Aki Härmä","doi":"10.1109/ICASSP39728.2021.9414642","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9414642","url":null,"abstract":"Speech is a common physiological signal that can be affected by both ageing and cognitive decline. Often the effect can be confounding, as would be the case for people at, e.g., very early stages of cognitive decline due to dementia. Despite this, the automatic predictions of age and cognitive decline based on cues found in the speech signal are generally treated as two separate tasks. In this paper, multi-task learning is applied for the joint estimation of age and the Mini-Mental Status Evaluation criteria (MMSE) commonly used to assess cognitive decline. To explore the relationship between age and MMSE, two neural network architectures are evaluated: a SincNet-based end-to-end architecture, and a system comprising of a feature extractor followed by a shallow neural network. Both are trained with single-task or multi-task targets. To compare, an SVM-based regressor is trained in a single-task setup. i-vector, x-vector and ComParE features are explored. Results are obtained on systems trained on the DementiaBank dataset and tested on an in-house dataset as well as the ADReSS dataset. The results show that both the age and MMSE estimation is improved by applying multitask learning, with state-of-the-art results achieved on the ADReSS dataset acoustic-only task.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125175690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
GDTW: A Novel Differentiable DTW Loss for Time Series Tasks 时间序列任务中一种新的可微DTW损失
Xiang Liu, Naiqi Li, Shutao Xia
{"title":"GDTW: A Novel Differentiable DTW Loss for Time Series Tasks","authors":"Xiang Liu, Naiqi Li, Shutao Xia","doi":"10.1109/ICASSP39728.2021.9413895","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9413895","url":null,"abstract":"Dynamic time warping (DTW) is one of the most successful methods that addresses the challenge of measuring the discrepancy between two series, which is robust to shift and distortion along the time axis of the sequence. Based on DTW, we propose a novel loss function for time series data called Gumbel-Softmin based fast DTW (GDTW). To the best of our knowledge, this is the first differentiable DTW loss for series data that scales linearly with the sequence length. The proposed Gumbel-Softmin replaces the simple minimization operator in DTW so as to better integrate the acceleration technology. We also design a deep learning model combining GDTW as a feature extractor. Thorough experiments over a broad range of time series analysis tasks were performed, showing the efficiency and effectiveness of our method.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125831739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
CMIM: Cross-Modal Information Maximization For Medical Imaging 医学影像的跨模态信息最大化
Tristan Sylvain, Francis Dutil, T. Berthier, Lisa Di-Jorio, M. Luck, R. Devon Hjelm, Y. Bengio
{"title":"CMIM: Cross-Modal Information Maximization For Medical Imaging","authors":"Tristan Sylvain, Francis Dutil, T. Berthier, Lisa Di-Jorio, M. Luck, R. Devon Hjelm, Y. Bengio","doi":"10.1109/ICASSP39728.2021.9414132","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9414132","url":null,"abstract":"In hospitals, data are siloed to specific information systems that make the same information available under different modalities such as the different medical imaging exams the patient undergoes (CT scans, MRI, PET, Ultrasound, etc.) and their associated radiology reports. This offers unique opportunities to obtain and use at train-time those multiple views of the same information that might not always be available at test-time.In this paper, we propose an innovative framework that makes the most of available data by learning good representations of a multi-modal input that are resilient to modality dropping at test-time, using recent advances in mutual information maximization. By maximizing cross-modal information at train time, we are able to outperform several state-of-the-art baselines in two different settings, medical image classification, and segmentation. In particular, our method is shown to have a strong impact on the inference-time performance of weaker modalities.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123273050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信